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TOPOLOGICAL  STRUCTURES  OF  INFORMATION  RETRIEVAL  SYSTEMS 


V  This  paper  considers  the  problem  of  information  retrieval  from  the 
point  of  view  ,of  graph  theory,  in  this  formulation  documents  are  represented 
as  nodes  and  relationships  among  the  documents  are  represented  by  edges. 

Two  types  of  graphs  are  introduced,  namely  the  similarity  graph  which  is 
based  on  subject-content  correlation  and  the  citation  graph,  which  is 
derived  from  direct  citation  linkages  among  documents.  Several  distance 
measures  are  considered  and  evaluated  with  regard  to  retrieval  operations.  ^ 

I.  Introduction 

Within  the  scope  of  this  paper  we  shall  consider  ah  information 
retrieval  system  to  consist  of  two  major  components,  namely,  a  document 
collection  and  a  retrieval  procedure,  that  is,  a  systematic  way  of  selecting 
a  subset  of  documents  of  the  collection  according  to  a  given  criterion. 

The  documents  in  the  collection  are  coupled  to  one  another  in  many 
different  respects,  such  as  subject  content.,  form,  authorship,  citations, 
etc.  Two  of  these  facets,  namely  subject,  content  and;  citations,  have  been 
exploited  for  application  in  retrieval. 

In  a  great  many  modern  information  retrieval  systems  the  characteristics 
in  subject  content  are  expressed  in  terms  of  subject  descriptors.  Attached 
to  each  document  is  a  set  of  subject  .descriptors  which  characterizes,  the; 
subject  content  of  the  document.  A  measure  of  the  similarity  between  a  pair 
of  documents  can  then  be  obtained  by  comparing  their  assigned  descriptors. 
Characterizations  of  documents  through  the  use  of  subject  descriptors  is 
known  as  coordinate  indexing. 
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In  retrieval  operation  a  query  is  presented  to  the  system  which  describes 
a  profile  of  the  type  of  documents  to  be  retrieved  from  the  collection.  In 
most  systems  employing  coordinate  indexing  today  the  query  is  given  in  terms 
of  a  set  of  descriptors  or  some  logical  function  thereof.  For  instance, 
we  may  ask  for  all  documents  that  deal  with  the  "decoding"  of  "Bose-Chandhuri- 
Hocquenghem  Codes"  that  are  published  in  the  "Transactions  of  IEEE  on 
Information  Theory"  since  "1964,"  where  those  terms  under  quotation  signs 
are  descriptors. 

Another  type  of  retrieval  systems  are  based  on  citation  indexing.  In 
this  type  of  systems  citation  information  among  documents  is  stored  in  the 
system.  The  query  is  given  in  terms  of  specifying  accession  documents  in 
the  network.  For  instance,  one  might  wish  to  retrieve  all  documents  citing 
a  document  d  or  one  might  wish  to  retrieve  all  documents  that  are  cited  by 
document  d.  Retrieval  operations  based  on  multi-generation  citations  are 
theoretically  feasible  but  so  far  have  not  received  much  attention. 

In  comparing  the  two  popular  schemes,  citation  indexing  is  easy  to 
instrument  but  is  limited  in  scope  in  that  it  derives  information  only  from 
existing  direct  linkages  in  the  document  collection.  This  restriction  is 
reflected  in  the  usual  incompleteness  of  retrieval  results  when  one  is 
interested  in -searches  based  on  subject  content. 

Oh  the  other  hand,  coordinate  indexing  works  well  only  if  the  indexed 
document  collection  is  relatively  homogeneous  and  the  query  well-defined. 

For  requests  from  research  scientists  the  query  is  always  aimed  at  the 
intersection  or  the  union  of  several  narrow  and  ill-defined  disciplines. 

As  a  result,  the  outcome  is  usually  contaminated  with  large  amounts  of 


irrelevant  material. 
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Aimed  at  retrieval  procedures  that. -will  produce  sharper  and  more 
complete  responses,  .we.  propose  the  study  of  potential  systems  that  combine 
the  resources  of  both  the  coordinate- indexing  approach  and  the  citation 
methods.  To  minimize  the  inconsistency  between  indexing  and  retrieval  we 
choose  to  represent  all  queries  in  terms  of  documents.  To  state  it 
formally*  the  problem -treated  in  this  paper  is  one  of  finding  an  information 
retrieval  system  that  combines  the  advantages  of  both  the  coordinate 
indexing  and  citation  indexing.  A  typical  retrieval  operation  would  be  the 
retrieval  of  a  set  of  documents  that  is  "close"  in  some  reasonable  measure 
to  a  given  document  profile.  To  facilitate  instrumentation  emphasis  is 
placed  on.  easily-implemented  systems. 

BI.  The  Correlation  Graph 

The  main  consideration  in  this  section  will  be  document  couplings  that 
are  subject-content  based.  Although  a  number  of  studies  have  been  made  in 
this  area  involving  fairly  complicated  couplings  and  their  interactions, 
the  type  of  couplings  to  be  investigated  here  will  be  relatively  simple  in 
nature  as  our  chief  objective  dwells  on  the  question  of  optimum  combination 
of  subject-content  based  indexing  and  non-subject-content  based  indexing. 

Let  us  consider  a  coordinate  indexing  scheme  in  which  each  document 
is  assigned  a  number  of  descriptors.  For  a  typical  system  the  total  number 
of  descriptors  will  be  of  the  order  of  10,000  while  each  document  may  be 
assigned  ten  to  fifteen  descriptors  on  the  average.  A  typical  curve  for 
descriptor  frequency  is  given  in  Figure  1.  The  behavior  of  the  curve 
sketched  in  Figure  1  can  be  explained5 as  follows.  It  is  observed  that 
typically  there  are  two  kinds  of  descriptors.  Descriptors  of  the  first 
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descriptors 


Figure  1.  Descriptor  Frequency  Distribution 

kind  may  be  termed  general  descriptors  and  have  a  high  probability  of  being 
used  for  many  documents.  Descriptors  of  the  second  kind  are  specialized  in 
nature  and  'have  a  low  probability  of  being  used  but  provide  the  system  with 
a  tremendous  amount  of  selectivity  whenever  they  are  present. 

The  dichotomy  of  the  descriptor  population  points  up  the  difficulty 
in  indexing  resolution.  In  the  interest  of  efficiency  it  is  necessary  to 
keep  the  number  of  descriptors,  especially  descriptors  of  the  general  type, 
small.  The  thesaurus  of  any  practical  system  is  therefore  usually  the 
result  of  compromises.  While  the  initial  resolution  may  be  adequate  for  the 
initial  collection  and  subject  to  most  queries,  the  system  may  not  perform 
satisfactorily  when  the  document  collection  grows  or  when  the  system  can  > 

not  be  defined  clearly  with  the  system's  limited  vocabulary. 

Let  us  consider  the  document  descriptor  matrix  A  which  has  m  rows  and 
n  columns.  With  each  row  A  is  associated  a  document  and  each  column  a 
descriptor.  The  entry  a^  takes  the  value  one  if  the  jth  descriptor  is 
assigned  to  the  ith  document  and  zero  otherwise.  We  define  the  mxm 


correlation  matrix  as 


5 


The  correlation  graph  is  defined  by  the  following  process.  We  assign  each 

document  a  node  and  assign  the  value  c^  as  the  weight  of  the  link  between 

nodes  i  and  j.  Thus  the  weight  c^  of  the  link  in  the  correlation  graph 

serves  as  a  measure  of  "closeness"  between  documents  i  and  j. 

It  is  noted  that  the  number  of  rows  of  C  is  equal  to  the  number  of 

documents,  m,  in  the  document  collection.  This  is  usually  a  large  number. 
T 

To  compute  AA  in  the  conventional  way  of  matrix  compulation  would  not  be 

an  attractive  approach.  Since  the  -number  of  descriptors  assigned  for  each 

individual  document  is  small  the.  density  of  entries  a^  in  A  is  very  low. 

T 

The  computation  of  C  =  AA  can  then  be  done  efficiently  by  list  processing 
techniques.  A  detailed  ^discussed  of  the  technique  will  be  given  in 
conjunction  with  the  analysis  of  the  citation  graph  in  the  next  section. 


III.  The  Citation  Graph 

-Another  class  of  structural  organizations  of  a  given  collection  of 
documents  can  be  obtained  by  exploiting  the  bibliographic  couplings. 

Several  types  of  bibliographic  couplings  may  be  envisaged,  such  as  those 
based  on  the  number  of  shared  references,  citation,  weighted  citation,  etc. 
Obviously,,  the  simplest  type  of  coupling  is  provided  by  direct  citation, 
which  may  be  considered  as  a  first  order  association  of  documents.  In  this 
scheme,  with  each  document  we  associate  a  set  of  documents,  i.e.  the 
documents  it  cites.  Citation  is'  interpreted  as  a  directed  relation  between 
citing  and  cited:  if  we  represent  documents  with-  nodes,  citation  can  be 
adequately  represented  by  directed  edges  from  the  citing  document  to  the 
cited  documents.  We  perform  this  representation  for  each  document  in  the 
collection  and  the  citation  graph  is  constructed. 


.6. 


Formally,  given  a  document  collection  BH  . . .  ,d  3  consisting  of 

documents  d^dg,.*.^  ,  the  directed  citation  graph  jB  pertaining  to  B  is 
entirely  described  by  an  nxn  matrix  E  =  ||e^ ^ || ,  where  e^  >  0  if  and  only  if 
document  d^  cites  document  d  ^ . 

As  noted,  citation  indicates  an  association  between  documents  and 


could  be  conveniently  exploited  in  retrieval  operations.  Specifically,  the 
citation  structure  may  be  particularly  useful  when  the  query  is  formulated 
by  specifying  a  non-empty  set  of  documents  Q  and  the  retrieval  goal  is  the 
extraction  of  a  set  R  of  documents  (R  ^  Q)  which  are  subject-related  to  the 
documents,  .of  Q.  In  the  simplest  instance,  Q  s  di>  i.e.  it  contains  a  single 
document  d^.  d^  is  denoted  as  the  access  point. 

The  determination  of  the  retrieved  s\.  .1  could  be  conveniently 

performed  in  a  mechanical  fashion  through  the  evaluation  of  some  single¬ 
valued  distance  function  defined  between  each  pair  of  nodes  of  the  graph. 

Before  analyzing  the  prerequisites  of  «  distance  function,  we  re¬ 
consider  the  directed  citation  graph  If' ye  take  citation  as  ai  sign  of 
subject-relation,  we  see  that  for  the  purpose  of  defining  subject-areas 
the  direction  of  citation  loses  its  importance.  This  leads  us  to  replacing 
the  directed  graph  £  with  the  undirected  graph  U,  simply  denoted  as  the 
citation  graph.  U  is  described  by  the  nxn  matrix 


where  now  t^  *  t^  >  0  means  that  d^  and  d^  are  linked  through  direct 
citation.  The  weight  of  the  linkage,  t^,  may  be  binary-valued  (0,1)  if 
we  are  simply  interested  in  the  presence  or  absence  of  citation.  In  more 


refined  schemes  it  could  be  real-valued  non-negative,  its  magnitude 
measuring  the  strength  of  coupling  in  a  normalized  interval  (0,1). 
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We  now  make  an  attempt  to  formulate  some  properties  which  seem  to  be 
desirable  for  a  distance  function  f^  defined  for  every  pair  of  nodes  d^, 
dj  of  the  graph  U:  obviously  f^  must  provide  an  intuitively  satisfactory 
measure  of  connectivity. 

First,  suppose  that  a  procedure  has  been  given  for  the  computation  of 
f .  It  seems  reasonable  to  require  that,  if  the  coupling  strength  t^ 
between  two  generic  documents  d^  and  d^  is  increased  (i.e.  t^  is  a. 
continuous  parameter),  the  distance  between  any  two  distinct  documents  d^, 
dj  cannot  increase.  Formally,  in  the  hypothesis  that  coupling  strengths 
are  continuous  parameters 

3fli 

8thk 

must  be  continuous  and  for  t  >0,  f  >  0  we  must  have 


<  0 


i.e.  f^j  is  a  mono tonic ally  non-increasing  function  of  the  t^'s. 

Secondly,  assume  that  two  documents  d^  and"  d^  are  linked  exclusively 
through  a  third  document  d.^,  i.e,  that  every,  and  each  path  P„  between  d^ 
and  dj  contains  d^.  In  this  case,  it  seems  natural  to  require  that  the 
distance  function  f , .  be  additive,  or 

ij 

ftj  *  £ik  +  £kj  ■  « 


We  must  point  out,  at  this  stage,  that  more  than  to  a  semantic 
similarity  between  documents,  we  are  aiming  to  some  easily  and  mechanically 


computable  correlation  based  on  the  citation  association. 


Returning  now  to  our  main  line,  we  notice  that  the  well-known  function 
"resistance"  defined  over  the  graph  U  would  meet  our  previous  requirements, 
(1),  (2).  The  graph  U  is  considered  as  a  resistive  network,  in  which  each 
edge  b^k  is  assigned  a  resistance  1/t^.  Since  the  resistance  R^  between 
any  two  nodes  d^,  d^  of  U  is  well-defined  we  could  let 

£u  *  Rlj  • 

In  addition  to  verifying  (1)  and  (2)  ,  R^  is  also  a  metric  function. 

Another  well-known  function  which  could  be  adopted  as  a  measure  of  distance 
is  the  "reliability"  between  pairs  of  nodes.  We  recall  that  reliability 
r^  between  d^  and  d^  is  the  probability  of  establishing  a  transmission 
path  between  d^  and  d^  if  t^  is  the  probability  of  correct  functioning  for 
the  edge  b^.  It  is  easy  to  recognize  that  both  requirements  (1)  and  (2) 
are  verified  by  r^. 

A  number  of  topological  techniques  are  knoton  for  the  evaluation  of 
either  the  resistance  function  or  the  reliability  function  respectively. 
These  techniques  are  satisfactory  for  most  applications.  In  computer 
based  information  retrieval  systems  however,  the  procedure  must  be  applied 
many  times  for  each  retrieval  operation  and  simplicity  in  methods  employed 
is  of  utmost  importance. 

For  this  reason,  we  turn  our  'attention  to  another  function  which  ;can 
be  defined  for  each  pair  of  nodes  of  U.  We  recall  that  a  circuit  is  a 
set  of  m  undirected  edges  b^jbg, . . . >bm,  such  that:  i)  each  b^  can  be 
oriented;  ii)  the  terminal  of  b^  coincides  with  the  original  of  b^+^;  iii) 
the  terminal  of  b^  coincides  with  the  origin  of  b^.  Obviously  a  circuit 
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G. ,  containing  d.  and  d.  is  composed  of  two  paths  which  are  edge-disjoint 

AJ  *  J 

(but  not  necessarily  node-disjoint) .  We  can  now  give  the  following 

Definition:  Let  G^  ^  ^  »  •  •  •  »Gij  ^  be  the  of  distinct 

circuits  containing  two  distinct  nodes  d^  and  d  ^ .  We  define  as  the  length 

(s) 


of  the  circuit  G 


ij 


(s  «  1,2, ...  ,n) 


i[Gi:1(s)]  , 


(k) 

the  sura  of  1/t^^  over  each  edge  belonging  to  G^ 

f  =»  min  j&[G  .  ^  ]  . 

ij  i  iJ 


Then  we  let 


(3) 


We  note  that  f  satisfies  requirements  (1)  and  (2)  .  In  fact,  if  thk 
is  the  weight  of  edge  b^k  and  G4,4  is,,  a  minimum  length  circuit,  then 


It  follows  that 


f..  *  2  ~ 

1J  b.  €  G.  ;  fchk 
hk  ij 


St 


hk 


if  bhk  *  Gij 
2  <  °  if  bhk  6  Gij 


By  letting  f^  «  0  for  each  i,  verification  of  property  (2)  follows  from 
the  stronger  statement  that  f^  ,.  as  given  by  (3)  ,  is  a  metric  function. 
The  proof  of  this  assertion  is  considerably  simplified  by  the  following 
lemma. 

Lemma :  If  there  is  a  circuit,  G^,  containing  d^  and  d£  and  a  circuit 
G2  containing  d2  and  d^,  then  there  exists  a  circuit  containing  d^  and 


V 
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Proof;  Let  consist  of  the  two  edge-disjoint  paths  Pp  P£  and 

similarly  G^,  consist  of  P^,  P^.  Since  P^  H  G ^  is  non-empty,  (at  least 

they  contain  node  starting  from  d^  and  proceeding  on  Pp  let  d^  be  the 

•k 

first  node  of  which  also  belongs  to  G£.  Similarly,  let  d ^  he  the 

analogous  node  on  Pj.  We  have  now  the  following  two  situations: 

•$( 

1)  dp  d£  belong  to  the  same  path  of  G^,  say  Pp  Then  traversing 
P^  from  d^  to  d2»  assume,,  with  no  loss  of  generality,  that  we  first  reach 

‘A1  *fc 

d^  (if  d^  =  d^,  it  is  immaterial  which  d^  (j  «  1,2)  is  chosen  as  the  first 

*fc 

node  reached).  Path  P^  is  therefore  partitioned  into  paths  d^P^dp 

“Jc  &  *$c  *fc 

^1P3^2’  ^2P3^2*  ^lP3^2  P088:I-bly  empty.  We  then  form  the  following 

ic 

paths  Pp  : 


At  At  At  *Jc 

We  claim  that  G  *  U  P^  is  a  circuit.  In  fact  the  path  djPjd^  *s  e<*8e” 

*  * 
disjoint  from  djP2d^  by  hypothesis  and  from  d^P^-^P^dj  by  construction 

(since  d^P^d^  contains  no  edge  of  G2) •  Similarly  d^P^d^  is  edge-disjoint 

A 

from  d^P^d2P2d2  by  hypothesis  and  from  d2?2d^  by  construction  (since  the 
latter  contains  no  edge  of  G2) • 

At 

2)  dp  d2  belong  to  different  paths  of  G2«  Assume  d^  €  P^  and 
At 

d,  6  P,  .  Then  we  form  the  two  paths 

4  * 


and  argue  as  in  case  1. 


Q.E.  D. 
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We  see  therefore  that  f^,  as  given  by  (3),  is  real-valued,  satisfies 
the  reflexive  property  by  definition  and  the  symmetric  property  because  of 
the  undirectedness  of  U.  The  triangle  inequality  follows  from  Lemma  1, 
since,  with  the  same  symbols,  G  consists  of  a  subset  (proper  or  improper) 
of  the  edges  of  G^  U  G^.  Hence 

4[G*1!  <  xCg^  +  if G2] 

and  the  inequality  holds  also  when  G^  and  G2  are  of  minimal  length.  We 
have  therefore  proved 

Theorem:  The  function  f  ^  (3)  is  a  metric  function. 

In>  addition  to  some  other  reason  which  we  shall  mention  later,  an 
interesting  feature  of  function  (3)  is  the  relative  ease  with  which  it  can 
be  mechanically  computed. 

A  string  S  is  a  sequence  over  the  set  of  symbols  (integers)  l,2,...,n. 
Over  the  set  of  strings  we  define  the  operation  of  a  string  product:  The 
string  product  of  and  is  their  concatenation  S^S^.  Clearly,  the 
string  product  is  associative  but  not  commutative.  With  the  symbol  0  we 
denote  the  zero  string,  i.e.,  the  string  of  no  symbols.  By  definition,  for 
every  S,  0*S  «  S*0  «  0.  Further  a  string  product  S  is  0  in  the  following 
circumstances  (nullification  rules): 

Rule  i)  S  is  of  the  form  . . .hk. . .hk. . .  or  . . .hk. . .kh. . .  (i.e.  a  given 
pair  of  consecutive  symbols  is  repeated  either  in  the  same  order  or  in 
reversed  order) . 

Rule  ii)  S  is  of  the  form  h...h  (i.e.  the  first  and  the  last  symbols 


of  S  coincide) . 
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Given  these  definitions,  we  construct  the  matrix  M,  obtained  from  A 
by  replacing  each  t^  >  0  with  the  integer  k,  which  is  now  regarded  as  a 
symbol  in  the  sense  specified  above. 

Assume  now,  for  simplicity,  that  we  aim  to  compute  the  distance  with 
respect  to  d^.  We  multiply  the  first  row  u^  of  M  by  M  and  replace  the 
ordinary  operation  of  multiplication  with  the  just  defined  string  product. 
We  obtain  the  vector 

(2)  (1)M 

u  *  u  M  . 


We  iterate  this  operation  s-1  times  and  obtain 


(s)  (s-1)  . 

u  »  u  •  -a  . 


(g) 

Let  us  analyze  u'  for  s  >  3.  Its  first  component,  which  is  then 
conventionally  set  to  0  (rule  ii)  ,  gives  a  collection  of  circuits  con¬ 
taining  d^^  and  composed  of  s  edges:  in  fact  rules  1,2)  of  nullification  of 
the  string  product  ensure  us  that  no  edge  is  traversed  more  than  once.  By 
this  iterative  procedure  we  can  obtain  all  circuits  containing  d^  with  up 
to  s  edges. 

The  computation  of  the.  distance  becomes  trivial  in  the  particular  case 
in  which  all  edges  are  equally  weighted*  e.g.  t^  *  1  for  any  existing 
edge.  In  this  case  the  distance  is  simply  the  number  of  edges  of  the 
shortest  circuit  containing  the  access  node  and  the  node  under  consideration. 
We  can  therefore  give  the  following  computer- oriented  algorithm  for  the 
search  of  all  documents  up  to  distance  s  from  a  specified  document  where  s 
is  used  as  a  control  parameter.  The  algorithm  takes  advantage  of  the  fact 
that  the  T  matrix  is  in  effect  very  sparse:  while  its  order  could  be  around 
several  tens  of  thousands,  the  number  of  non-zero  entries  per  row  (the 
degree  of  the  node)  is,  on  the  average,  close  to  10. 
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Algorithm.  Each  document  d^  €  B  is  specified,  through  its  accession 
number,  for  simplicity,  i.  With  each  i  we  associate  a  list  L^,  i.e.  a 
collection  of  integers  which  are  the  accession  numbers  of  the  documents 
directly  linked  through  citation  with-  L:  the  integers  belonging  to  L^  are 
assumed  to  be  naturally  ordered. 

Let  i  be  the  document  specified  by  the  query,  i.e.  the  access  point. 
With  L  we  designate  the  current  list:  each  term  of  L  is,  in  general,  a  sum 
of  all  the  string  products  having  equal  last  symbol;  the  terms  are  ordered 
by  increasing  last  symbol. 

1.  ,c,et  r  -  2.  Let  L  *  t  . 

2.  Let  . . .  ,«n  be  the  last  symbols  of  the  terms  of  L.  Set 

r 

j  -  1. 

3.  Call  from  the  archive  list  L  and  form  the  string  product  of  the 

j 

term  ending  with  a.  by  each  term  of  L  .  If  j  <  n  ,  replace  j  with 

J 

j+1  and  repeat  step  3;  if  j  «  n^  go  to  4). 

4.  Sort  all  string  products  obtained  in  iterations  of  step  3  by 
increasing  last  symbol:  form  new  terms  by  adding  all  string  products  with 
equal  last  symbol.  For  r  >  2,  the  terra  ending  with  i  provides  all  circuits 
of  length  r. 

5.  Apply  nullification  rules  i)  and  ii)  on  the  list  obtained  in 
step  4.  The  resulting  list  is  the  new  L.  If  r  *  s,  the  algorithm 
terminates.  If  r  <  s,  replace  r  with  r+1  and  return  to  step  2. 

The  described  algorithm  provides  all  circuits  containing  the  access 
node  and  having  up  to  s  edges:  the  actual 
requires  no  further  comment. 


-  -g* 


computation  of  the  distance 
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We  must  not  overlook  the  possible  objection  that  however  simple  the 
previous  algorithm  may  appear,  the  length  of  the  current  list  L  may  reach 
extremely  high  values  for  sufficiently  high  s.  This  geometric  explosion 
with  ratio  equal  to  the  average  degree  of  the  nodes  would  certainly  take 
place  if  document- links  were  assigned  at  random.  In  our  case,  however,  it 
appears  that  the  structure  of  the  citation  network,  through  ,the  strong 
interconnection  of  documents  in  a  given  subject  area,  acts  in  favor  of  a 
much  milder  increase:  simple  manual  trials  appear  to  confirm  this 
intuition,  but  only  more  extensive  experiments  can  have  a  probatory 
value. 

Another  promising  feature  of  the  circuit  concept  is  related  to  the 
remark  that  possibly  irrelevant  documents,  relatively  close  through  citation 
to  the  access  document,  are  excluded  from  the  retrieved  set  R:  the 
intuition,  in  fact,  would  suggest  that  if  there  is  only  one  path  from  the 
access  node  to  the  node  representative  of  a  given  document,  the  latter  is 
most  likely  not  subject-related  to  the  query. 

IV.  Schemes  for  Combined  Retrieval 

In  the  two  previous  sections  we  have  analysed  the  correlation  graph 
and  the  citation  graph  as  two  structural  organisations  which  can  be 
conveniently  exploited  for  document  retrieval.  As  mentioned  in  the 
introduction,  it  seems  very  attractive  tq  combine  the  power  of  the  two 
structures  in  order  to  mitigate  their  respective  shortcomings,  i.e.  the 
disturbance  or  "noise"  caused,  for  example,  by  homographs  in  coordinate 
indexing  or  by  careless  citation. 


ss. 
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If  the  query  is  specified  by  a  single  document  (and  there  seems  to  be 
no  conceptual  difficulty  in  passing  from  single  to  composite  queries)  ,  by 


following  the  criteria  presented  in  Sections  II  and  fill,  we  can  compute 


two  distances  of  each  document  d.  from  the  query  d . :  i.e.  f 


Cl) 


j  . .  ”i*  ‘ij  * 

(2) 


as 


obtained  from  the  correlation  graph,  and  f^  -  ,  as  obtained  from  the 

citation  graph.  The  combined  distance  f^.  must  very  reasonably  be  an 

<  (x)  (2) 

increasing  function  of  f^  and  f^  .  The  two  simplest  expressions  of 

f which  we  propose  are 


f.  .  -  a  f  +  kf.,(2) 

ij  li'j  '1  ij 

ln  fij  “  a2  ln  fij(  ^  +  b2  -Vl  *i/  ^ 


(4) 

(5) 


where  are  positive  constants.  We  remark  that  function  (4) 

corresponds  to  the  set  theoretical  operation  of  union  when  applied  to  the 
two  graphs,  while  (5)  corresponds  to  the  set  theoretical  operation  of 
intersection. 

No  insight  has  so  far  been  obtained  into  the  possible  values  of  the 
constants  a^^jb^^.  An  extensive  experiment  has  been  planned  which 
should  shed  light  on  this  aspect  of  the  proposed  scheme,  as  well  as  on 
further  theoretical  developments. 
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